Processing single-cell RNA-seq datasets using SingCellaR

Summary Single-cell RNA sequencing has led to unprecedented levels of data complexity. Although several computational platforms are available, performing data analyses for multiple datasets remains a significant challenge. Here, we provide a comprehensive analytical protocol to interrogate multiple datasets on SingCellaR, an analysis package in R. This tool can be applied to general single-cell transcriptome analyses. We demonstrate steps for data analyses and visualization using bespoke pipelines, in conjunction with existing analysis tools to study human hematopoietic stem and progenitor cells. For complete details on the use and execution of this protocol, please refer to Roy et al. (2021).

different blood cell lineages can help exploration of the cell types contained within a dataset independently of cell clustering algorithms. To annotate cell types and states, SingCellaR uses gene set enrichment analysis to calculate enrichment scores for user-defined, curated gene sets. This system can be used alongside manual inspection for canonical marker genes, or other existing algorithms (Aran et al., 2019;Pliner et al., 2019;Shao et al., 2020;Zhang et al., 2019).
We recently demonstrated the utility of the SingCellaR pipeline in a study of human hematopoiesis. Studying the site-and stage-specific changes of normal hematopoiesis over human development is crucial to understanding the origin of disorders that tend to emerge at specific ages. We analyzed scRNA-seq datasets from hematopoietic stem and progenitor cells (HSPCs) from five different tissues sampled across four stages of the human lifetime (Roy et al., 2021). This study aimed to explore dynamic changes in the cellular compositions of lineage-specific HSPCs and identify the site and developmental stage-specific transitions in gene regulatory networks across human developmental stages.
Here we present a comprehensive protocol for the analytical pipeline demonstrating step-by-step data analysis and visualization as employed in the recent publication (Roy et al., 2021). We cover the software installation, processing of an individual sample and integrating multiple samples with benchmarking of multiple integration methods, cell clustering, cell type annotation, implementation of AUCell, differential abundance, and trajectory analyses. We also provide R functions for single-cell data visualization useful for interpreting results. Finally, limitations, problems, and potential solutions are discussed. All processed datasets, R objects, and codes are available and maintained on the published data repositories.

scRNA-seq datasets
Timing: 5-10 min In this protocol, we use scRNA-seq datasets of hematopoietic stem and progenitor cells (HSPCs) from human tissues across different developmental stages, including early fetal liver (eFL), matched fetal liver (FL) and bone marrow (FBM) isolated from the same fetuses, pediatric bone marrow (PBM), and adult bone marrow (ABM). Here, human fetal liver samples from first and second trimester are referred to eFL and FL respectively. Table S1 in (Roy et al., 2021) shows the summary of samples and single-cell quality information. Sample collection, fluorescence-activated cell sorting (FACS) strategies, high-throughput sequencing using 103 Genomics platform and data pre-processing steps are described in the method section of (Roy et al., 2021).
to facilitate R packages developed for biological data analysis. GitHub is a commercial repository that hosts services for individuals and teams for software version control and collaboration. The R functions 'install.packages', 'BiocManager::install', and 'devtools::install_github' will be used to install packages from CRAN, Bioconductor, and Github, respectively. Prior to installation, the function 'if(!require("package_name"))' will be used to check if the package has already been installed on a computer, and installation will proceed for packages not yet installed on the computer.
1. Install SingCellaR from GitHub by running the following R code: CRITICAL: We tested SingCellaR installation on macOS Mojave and Catalina, and Windows 10.
R version 4 or higher is required with other R packages, as shown in the key resources table. SingCellaR incorporates functions from Python (version 3.8) modules that have to be installed as described below. The timing of this step depends on the installation of devtools and Bioconductor packages. Devtools package is an R package to support the installation for other packages that are not yet in a standard package repository, such as CRAN or Bioconductor. R packages, including SingCellaR, usually depend on other R packages (dependencies) to support their full range of functionalities. Therefore, the time to complete the installation depends on the number of dependencies that have to be installed on the computer.
2. Install required python modules by running the following R code: a. Modules required for the force-directed graph analysis.
b. Module required for doublet removal using Scrublet. > library(reticulate) > conda_create("r-reticulate", python_version="3.8") > py_install("fa2", envname="r-reticulate") > py_install("networkx", envname="r-reticulate") > py_install("Scrublet",envname="r-reticulate") ll OPEN ACCESS CRITICAL: We tested the fa2 python module for the force-directed graph analysis using Python versions 2.7, 3.6 and 3.8. Using Conda environment is recommended in conjunction with reticulate package. Conda (https://docs.conda.io/en/latest/) is a virtual environment management system for Python. With Conda, the user can create, remove, and update environments that have different versions of Python packages installed in them. This flexibility is especially helpful on devices in which the user does not have administrative privileges to install Python packages on the device and compile version control for specific packages.
3. Install required R packages by running the following R code: a. harmony -required for data integration using Harmony method (Korsunsky et al., 2019).
c. doParallel and doRNG -required for parallel processing in AUCell analysis.
d. DAseq -required for the analysis of differential abundance (Zhao et al., 2021).
Note: The destiny package is not available for Bioconductor version 3.13. The user can install this package from GitHub.
g. scanorama -required for the data integration. The user can install scanorama using pip command line: h. Seurat CCA/RPCA -required for the data integration.
i. rliger -required for the data integration.
j. ComBat -required for the batch removal.
k. Limma -required for the batch removal. This step creates a SingCellaR object for each sample and performs quality control (QC) to identify cells that qualify for further downstream analyses. We demonstrate the selection process of high-quality cells using multiple QC plots, data normalization, and identifying highly variable genes. The process starts from reading in the input files for each sample generated directly from the cellranger pipeline. Here, we show an early fetal liver (eFL_1) dataset as an example.

Load SingCellaR package into R environment:
Note: './SingCellaR_objects' is a local folder for this analysis. The user can change the folder name to a suitable file path in a local computer or server.
5. Create SingCellaR object. SingCellaR object is an extension of the SingleCellExperiment (Amezquita et al., 2020) object for storing data generated from single-cell experiments. The R code below shows how to read in the input files for generating the SingCellaR object using the function 'load_matrices_from_cellranger'. The input files are from the output of the cellranger pipeline version 3.0.1. The folder eFL1 contains input files consisting of cell barcodes (barcodes.tsv.gz), gene features (features.tsv.gz), and matrix (matrix.mtx.gz) containing unique molecular identifier (UMI) counts for gene expression. The user can assign a unique identifier to the sample in this step.
Now the SingCellaR object -eFL_1 is created.
CRITICAL: The 'cellranger.version' parameter is required to be compatible with the cellranger pipeline output file. The cellranger output file 'features.tsv.gz' for gene features is generated from Cell Ranger version 3 and above, whereas version 2 of Cell Ranger creates the file 'genes.tsv.gz'.
6. Create cell metadata. Cell metadata will be created. Rows represent cells and columns represent variables. Variables computed in this step include the number of UMIs and detected genes per cell, and percentage of mitochondrial gene expression for each cell.
The following parameter is required: a. mito_genes_start_with: Gene names starting with 'MT-' are used as a set of mitochondrial genes for human and 'mt-' for mouse samples.
Note: The cell metadata can be accessed using the function 'get_cells_annotation(eFL_1)' or eFL_1@meta.data. The user can manually add additional information to the columns of the cell metadata.
7. Visualize QC matrices. QC matrices computed in step 6 can be explored using the plotting functions: a. Histogram ( Figure 1A).
c. Plot of the number of UMIs versus the number of detected genes per cell ( Figure 1C).
8. Annotate cell quality, identify expressed genes, and assign cell and gene status into metadata. After visualizing QC matrices in step 7, the user can specify filtering parameters using the observed number of UMIs and detected genes per cell, and percentage of mitochondrial gene expression. The function 'filter_cells_and_genes' assigns a new column named IsPassed that will be added into the cell metadata. Cells that pass QC will be annotated as TRUE. Expressed genes will be identified in this step and the column named IsExpress will be added into the gene metadata. Genes expressed above and below the user-defined threshold will be annotated as TRUE or FALSE. Although all original cells and genes are retained in the metadata and gene expression matrix in this step, these annotations will be used to subset cells and expressed genes for further downstream analyses. The number of cells passing QC thresholds will be shown on the R console after running the following code: > plot_cells_annotation(eFL_1,type="histogram") > process_cells_annotation(eFL_1, mito_genes_start_with="MT-") > plot_cells_annotation(eFL_1,type="boxplot") c. min_detected_genes: The lower threshold for the number of expressed genes, above which cells are annotated as high-quality. To be used in conjunction with max_detected_genes argument. Default value is 1,000. d. max_detected_genes: The upper threshold for the number of expressed genes, below which cells are annotated as high-quality. To be used in conjunction with min_detected_genes argument. Default value is 8,000. e. genes_with_expressing_cells: The lower threshold for the number of cells in which a gene is expressed (UMI >=1), above which, the gene will be annotated as expressed. Default value is 10. f. isRemovedDoublets: If set to TRUE (default), doublets will be removed prior to downstream analyses.
Note: Parameters used for filtering out cells with low quality can be varied across tissues and organs from different datasets. For example, the percentage threshold for mitochondrial gene expression, 'max_percent_mito' can be set higher for liver tissues (MacParland et al., 2018). The user can use the plotting functions as described above to explore different cutoffs for all samples.
Optional: In addition, we have incorporated the Scrublet package (Wolock et al., 2019) for doublet removal into the latest version of SingCellaR. The user can add the optional step below to detect and exclude doublets: 9. Normalize UMI counts. SingCellaR scales UMI counts by normalizing each library size to 10,000 or mean library size.
The following parameter is required: use.scaled.factor: When set to TRUE, the gene expression values will be multiplied by 10,000 (by default) and normalized against the library size (total UMI counts) of each cell. The user can change the scale factor value using the scale.factor parameter. If set to FALSE, the function will use the mean of library size across all cells as the scale factor.
CRITICAL: Using use.scaled.factor=TRUE is recommended for 103 genomics data. The user should consider specifying use.scaled.factor=FALSE for scRNA-seq data generated from plate-based protocols, e.g., Smart-seq2 (Picelli et al., 2014) and TARGET-Seq (Rodriguez-Meira et al., 2019). This is because the sequencing depth of scRNA-seq data generated from the plate-based protocols is in the higher ranges compared to 103 genomics data (e.g., from a hundred thousand to millions of reads per cell). Thus, using a scale factor of 10,000 is insufficient and leads to small gene expression values. The user should set use.scaled.factor=FALSE in order to use the mean library size from all single cells as the scale factor.
10. Identify highly variable genes. SingCellaR uses a gamma generalized linear model (GLM) as the fitting method for gene expression and coefficient of variation to identify highly variable genes.
> normalize_UMIs(eFL_1, use.scaled.factor = T) The GLM is suitable for modeling log-normal data from a sparse normalized gene expression matrix. A column named IsVarGenes is added to the gene metadata and genes identified as highly variable are annotated as TRUE, while all other genes are annotated as FALSE. The number of genes used to fit in the model and the number of identified variable genes will be shown after running the following code: The following parameters are required: a. mean_expr_cutoff: The mean normalized expression value, above which, the genes are identified as highly variable. Default value is 0.1. b. disp_zscore_cutoff: The dispersion of z-score, above which, the genes are identified as highly variable. Default value is 0.1.
Note: In this step, we used lower cut-off values to increase the number of detected highly variable genes per sample for the downstream analyses.
11. Save the R object for further analyses.
Pause point: The user can pause the analysis after pre-processing each sample.

Integrating biological replicates
Timing: 15-30 min for each group, 1-2 h for all stages This step integrates the individual R objects from pre-processed biological or technical replicates generated from step 12. Here, we illustrate the integration of two early fetal liver samples collected from two donors (Roy et al., 2021).
14. Integrate pre-processed biological replicates. The user will initialize the integrated SingCellaR class object 'SingCellaR_Int' and assign a unique identifier prior to merging two datasets using the 'preprocess_integration' function. 15. Annotate cell quality and expressed genes. Filtering process has been already performed separately for each sample (see step 8), therefore the filtering parameters for this step will be set to include all cells. From the filtering output below, all cells will be retained, after running the following code: 16. Normalize and scale UMI counts.
Note: See step 9 for details of required parameters.
17. Regress out confounding factors. The normalized and scaled gene expression values from step 16 will be adjusted by regressing out cell-to-cell variation in gene expression values due to confounding factors (e.g., batch and donor effect). To this end, SingCellaR implements the 'lmFit' function from the R package limma (Ritchie et al., 2015), SingCellaR provides the 'remove_ unwanted_confounders' wrapper function to regress out the unwanted source of variation.
Here, we will regress out the effects of UMI counts, percentage of mitochondrial gene expression, and donor. Adjusted gene expression will be used for further analyses.
The following parameter is required: a. residualModelFormulaStr: The formula format used to regress out confounding factors. The names of variables defined have to be the same as the column names of the cell metadata.

OPEN ACCESS
Note: The user can change the residualModelFormulaStr parameter and perform the following steps down to step 23 (UMAP analysis) to explore the effect on cell clustering by specifying different sets of confounding factors. For example, the user can set residualModel-FormulaStr = "UMI_count+percent_mito" to compare with residualModelFormulaStr = "UMI_count+percent_mito+sampleID" to explore the effect of sample.
18. Identify highly variable genes. The number of genes used for fitting the GLM model and the number of highly variable genes will be identified after running the following code: Note: See step 10 for details of required parameters.
19. Remove selected genes. Here, we remove mitochondrial and ribosomal genes from highly variable genes identified from step 18 to avoid the skewed effect of ribosomal and mitochondrial gene expression in downstream analyses. The number of genes that are excluded will be shown after running the following code: Note: The human.ribosomal-mitochondrial.genes.gmt file can be downloaded from Zenodo: https://doi.org/10.5281/zenodo.5879071 under the folder Human_genesets. 20. Visualize highly variable genes ( Figure 2A).

Perform principal component analysis (PCA).
To interpret the relationship across single cells, dimensionality reduction methods are required to reduce high dimensionality data to the visualizable two-or three-dimensional space. In PCA, the reduced dimensional space is represented by principal components. The top PCs will capture most of the variance of the dataset. Here, we perform linear dimensionality reduction using PCA and then bring forward the most informative PCs for further nonlinear dimensionality reduction to visualize cells in a two-dimensional space.
To this end, highly variable genes identified from steps 18-19 and visualized in step 20 will be used for PCA using the 'runPCA' function, a wrapper function for the 'irlba' function from the irlba package (http://bwlewis.github.io/irlba/). The following parameters are required: a. use.components: The number of principal components (PCs) to estimate. Default value is 50. b. use.regressout.data: If set to TRUE (default), the adjusted gene expression values from step 17 will be used.

22.
Visualize principal components. The elbow plot is used to determine the number of PCs to be included for further dimensionality reduction analyses ( Figure 2B). The number of PCs to specify downstream should correspond to the elbow point, used as a cut-off to include PCs that capture most of the biological variations found in the data.
23. Perform nonlinear dimensionality reduction analyses. Running nonlinear dimension reduction on all highly variable genes requires high computational resources and processing time. Hence, using identified PCs from step 22 for nonlinear dimension reduction analysis is a standard technique for scRNA-seq analysis. Nonlinear dimension reduction analysis is suitable for capturing cellular heterogeneity (Xiang et al., 2021). Here, we use Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018) by running the 'runUMAP' wrapper function that  Note: The user can also apply the t-Distributed Stochastic Neighbor Embedding (tSNE) approach using the 'runTSNE' function. The reduced dimension coordinates for UMAP and tSNE can be accessed using functions 'get_umap.result' and 'get_tsne.result' respectively.
24. Visualize cell lineages on low dimensional space. To explore whether the cells in each lineage are clustered in close proximity, we will visualize the UMAP result with the expression of multilineage gene sets using the 'plot_umap_label_by_multiple_gene_sets' function ( Figure 2C).
Here, we observe that cells are clustered together according to major cell types.
The following parameters are required: a. gmt.file: Path to the file containing the gene signatures in GMT format. b. show_gene_sets: The names of the gene signatures to plot on UMAP. c. custom_color: The color assignment to each signature specified in 'show_gene_sets'. d. isNormalisedByHouseKeeping: When set to TRUE (default), the gene expression values of the individual genes of each gene signature specified will be normalized by the housekeeping genes. The housekeeping genes are defined as the top 100 genes with the highest total gene expression values across all cells. e. point.size: Size of the data points on UMAP plot. Default value is 2.
The following parameters are required: a. n.dims.use: The number of PCs to use. Default number is 30. b. dim_reduction_method: The dimensionality reduction analysis name. c. n.neighbors: The number of neighboring cells. This number may be the same as specified in step 23 if UMAP is used. Default number is 30. d. knn.metric: The distance metric, 'euclidean' is used by default. Another option is 'cosine'.
Note: The cluster metadata for each cell can be accessed using the 'get_cluster' function.
26. Visualize clusters on low dimensional space. Louvain clusters can be shown on UMAP using the 'plot_umap_label_by_clusters' function ( Figure 2D).
The following parameters are required: a. show_method: The clustering detection name used as in step 25. b. mark.clusters: If set to TRUE (default) , cluster identifiers will be shown on the plot.
27. Identify cluster-specific genes. The user can identify marker genes, which are particularly expressed in each cluster using the 'findMarkerGenes' function. Differentially expressed gene analysis between one cluster against all other clusters is performed using the nonparametric Wilcoxon test on normalized expression values for the comparison of expression level and Fisher's exact test for the comparison of expressing cell frequency (Giustacchini et al., 2017). P-values generated from both tests are then combined using Fisher's method and are adjusted using Benjamini-Hochberg (BH). This process is iterated for each cluster against all other clusters, therefore the processing time in this step is dependent on the number of cells and clusters. By default, the minimum log2 fold change 'min.log2FC' parameter is set to 0.5 and the minimum fraction of expressing cells in each cluster 'min.expFraction' parameter is set to 0.3.
The following parameter is required: a. cluster.type: The clustering detection method used in step 25.
28. Save R object for further analyses.

Repeat the integration process for all biological replicates of FL, FBM, and PBM samples.
There is only one sample for ABM. Therefore, data integration is not required for this sample at this step. All R codes provided for each integration are available at Zenodo: https://doi. org/10.5281/zenodo.5879071.
Pause point: The user can pause the analysis after integrating the biological replicates for each developmental stage and save the results in multiple SingCellaR objects.
Integrating samples from all tissue types The aim of integrating all samples is to assess the existence of batch or donor-specific effects that are confounding factors contributing to differences in gene expression profile across samples. Examples of batch effect include differences in library preparation methods, sequencing batch, and donor or sample ID (Tran et al., 2020). If the batch effect is observed, it can be adjusted by a variety of existing batch correction and integration methods incorporated in SingCellaR, including a novel technique, namely Supervised Harmony that we have developed and described in (Roy et al., 2021). SingCellaR also implements the wrapper functions for other integration methods including Harmony (Korsunsky et al., 2019), Seurat (Hao et al., 2021), Liger (Gao et al., 2021), Scanorama (Hie et al., 2019), Combat (Johnson et al., 2007), and Limma (Ritchie et al., 2015). In this step, we demonstrate examples of how to use these methods for data integration using the wrapper functions. A strategy to benchmark how well single cells are clustered across covariate variables (e.g., batch and donor) is illustrated.
General examples of data integration include integrating samples from healthy donors and patients (Granja et al., 2019;Psaila et al., 2020) and from patients at different disease stages such as in cancer at diagnosis, remission, and relapse. Here, we illustrate the integration of HSPCs from five tissues spanning four different stages of human development that also happened to be processed and sequenced in two different batches (Roy et al., 2021).

Load SingCellaR package.
31. Initialize the SingCellaR_Int object and merge datasets generated from step 29. The user should observe that there are no cells being filtered out after running the following code: 33. Incorporate donor and sequencing batch information into cell metadata. This information is required to perform the batch correction. 39. Visualize principal components. Based on the elbow plot, the first 40 PCs will be used for data integration.
40. Integrate data using Supervised Harmony. We introduce Supervised Harmony, a method for data integration implemented in SingCellaR. Supervised Harmony can be performed using the 'runSupervised_Harmony' function. This method is an adaptation of Harmony method (Korsunsky et al., 2019). More details of this method were described (Roy et al., 2021). Here, sequencing batch and donor IDs as defined in step 33 are specified as covariates.
The following parameters are required: a. covariates: The name(s) of the covariate(s) specified as batch effect to be adjusted. The names should be the same as the column names of the cell metadata. b. n.dims.use: The number of PCs as determined from step 39 to be used in this step. c. hcl.height.cutoff: The cutree cut-off value for hierarchical clustering. Default value is 0.25. d. harmony.max.iter: The maximum number of rounds to run harmony. Default value is 10. e. n.seed: The random seed number generator. Default value is 1.
CRITICAL: Before running Supervised Harmony method, the 'findMarkerGenes' function must be performed for each developmental stage analysis (see step 27). The seed number (random number generator) and software version can vary across different devices. Hence, the user may notice variations in the rotation of the plots and clusters, which can be verified and visualized using lineage genes (see step 24).

Nonlinear dimension reduction analysis.
The UMAP analysis result from Supervised Harmony integration will be saved. This UMAP object will be used to compare with the results from other integrative methods (see steps below).
42. Integrate data using Harmony. SingCellaR also implements a wrapper function for Harmony integration method (Korsunsky et al., 2019). Harmony can be performed using the 'runHarmony' function. Here, sequencing batch and donor IDs as defined are specified as covariates. UMAP analysis will be performed to obtain the embedding.
The following parameters are required: a. covariates: The name(s) of the covariate(s) specified as batch effect to be adjusted. The names should be the same as the column names of the cell metadata. b. n.dims.use: The number of PCs as determined from step 39 to be used in this step. c. harmony.max.iter: The maximum number of rounds to run harmony. Default value is 10. d. n.seed: The random seed number generator. Default value is 1. The UMAP analysis result from Harmony integration will be saved. This UMAP object contains cell metadata and UMAP coordinates that will be used to compare with the results from other integrative methods.
43. Integrate data using Seurat. SingCellaR implements two wrapper functions for Seurat integration (Hao et al., 2021). First, the function 'runSeuratIntegration' is for the standard Seurat integration with Canonical Correlation Analysis (CCA). Second, the function 'runSeuratIntegration_-with_rpca' is for the fast integration using Reciprocal PCA (RPCA). More details about Seurat integration are described on Seurat's website (https://satijalab.org/seurat/). Due to the fast integration of using RPCA, in this protocol, we will demonstrate the function 'runSeuratIntegra-tion_with_rpca' as an example. However, the user should try Seurat CCA to make a comparison of the integrative results. The user can find how to use the function 'runSeuratIntegration' from SingCellaR's vignette. After the integration, the UMAP analysis from Seurat RPCA integration will be performed to obtain the embedding.
The following parameters are required: a. Seurat.metadata: The cell metadata. b. n.dims.use: The number of PCs as determined from step 39 to be used in this step. c. Seurat.split.by: The indicated feature name found in the cell metadata for splitting samples for integration. d. Use.SingCellaR.varGenes: If set to TRUE, the highly variable genes identified by SingCellaR will be used. If set to FALSE, the highly variable genes will be identified using Seurat. Default value is FALSE. Next, the UMAP analysis from Seurat RPCA integration will be performed and saved. This UMAP object will be used to compare with the results from other integrative methods. 44. Integrate data using Scanorama. SingCellaR implements a wrapper function for Scanorama integration (Hie et al., 2019). Scanorama can be performed using the 'runScanorama' function.
After the integration, the standard PCA and UMAP analyses from Scanorama integration will be performed to obtain the embedding.
45. Integrate data using Limma batch correction method. To perform Limma analysis (Ritchie et al., 2015), SingCellaR provides the 'remove_unwanted_confounders' wrapper function to regress out the unwanted source of variation. We illustrate below for regressing out the effect from the library size, donor, and batch. 46. Assign a cell type to single cells using the AUCell analysis. In this step, we will perform AUCell analysis (Aibar et al., 2017) with seven lineage signature genes including HSC/MPP, myeloid, lymphoid, erythroid, megakaryocytic, eosinophil/basophil/mast, and endothelial progenitor cells. This step will identify the cell types that can be used for benchmarking distinct integrative methods. We assume that cells with high AUCell scores, high expression of signature genes, indicate strong ground truth of the assigned cell type. Therefore, we would expect that the same cell type should be aggregated well together when applying integrative methods.
The following parameters are required: a. exprMat: The raw expression count matrix. This can be retrieved from the SingCellaR object using the 'get_umi_count' function. b. nCores: The number of cores to use for parallel processing. The maximum number of cores is dependent on the user's device. Default value is 1. c. plotStats: If set to TRUE (default), the expression statistics will be summarized and plotted in the histogram and boxplots.
Note: This step may be time-consuming. The user is advised to save the output of this step using the following code: Next, the AUCell analysis will be performed using the 'Run_AUCell' function.
The following parameters are required: d. AUCell_buildRankings.file: The input file name from the AUCell rankings. To explore the AUCell scores on UMAP plots, the user can run UMAP analysis using different types of integrative methods. This step is to identify the AUCell cut-off score for a particular cell type. The example below shows the 'plot_umap_label_by_AUCell_score' function that will be used to plot the myeloid AUCell scores ( Figure 3A).
Next, cells with high AUCell scores (e.g., > 0.2 or > 0.15) for each cell type will be assigned.

OPEN ACCESS
The user can explore the number of cells with high AUCell scores for each cell type using the function below and the Human_HSPC.CellType data frame object will be saved for use as the reference to perform benchmarking explained in the next step.
47. Benchmark distinct integrative methods using LISI and kBET methods. Next, we assess whether single cells with identified cell types derived from the AUCell analysis are clustered well across covariate variables (e.g., batch and donor). SingCellaR provides the wrapper functions for a Local Inverse Simpson's Index (LISI) (Korsunsky et al., 2019) and k-nearest-neighbor batch-effect test (kBET) (Buttner et al., 2019) to measure LISI and kBET scores across different integrative methods. First, the 'runLISI' function is performed as shown below. This function will plot LISI scores across different integrative methods ( Figure 3B). The following parameters are required: a. lisi_label1: The covariate variable name of interest such as batch or donor. Default value is donor. b. lisi_label2: The variable name that represents ground truth or high AUC score cell type.
Default value is CellType. c. reference.celltype.rds.file: The RDS file name that contains cell type information. d. integrative.umap.rds.files: The RDS file names that contain UMAP coordinate information generated by different integrative methods. e. integrative.method.names: The integrative method names that represent in the same order as in integrative.umap.rds.files. f. IsShowPlot: If set to TRUE (default), the iLISI and cLISI scores will be plotted. Second, the 'runKBET' function is performed as shown below. This function will calculate kBET scores across different integrative methods and return a data frame that can be used for plotting.
The following parameters are required: g. Covariate_variable_name: The covariate variable name of interest such as batch or donor.
Default value is donor. h. reference.celltype.rds.file: The RDS file name that contains cell type information. i. integrative.umap.rds.files: The RDS file names that contain UMAP coordinate information generated by different integrative methods. j. integrative.method.names: The integrative method names that represent in the same order as in integrative.umap.rds.files. k. n.sample: The downsample size of data points used in kBET analysis. Default value is 1,000.
Note: This step may be time-consuming, depending on the number of cells downsampled for kBET analysis. Next, kBET scores across integrative methods will be plotted using the 'ggplot' function (Figure 3C).
In this step, we illustrate how to benchmark integrative results generated from different methods using the wrapper functions for LISI and kBET analyses implemented in SingCellaR. We show the objective measurement of integration for each method using iLISI and cLISI scores ( Figure 3B) and kBET average acceptance rate score ( Figure 3C). The user can observe from the plots that Supervised Harmony shows higher kBET and iLISI scores, and cLISI score is close to 1, indicating better data integration for this HSPC dataset compared to other methods described in (Roy et al., 2021). Therefore, the integration result from Supervised Harmony will be used for further downstream analyses.
48. Visualize selected features and cell lineages on low dimensional space. Here, we will assess whether the data integration was performed successfully. To this end, we annotate the UMAP by sample ID, donor type, sequencing batch, and lineage signature genes. After running the following codes, we observe that cells are clustered by cell lineage, while sample ID, donor type and sequencing batch effects are successfully corrected. These indicate that data integration and batch correction was effective in eliminating batch effect, while enabling functionally related cell to be clustered in close proximity. a. Annotate UMAP plot by sample ID ( Figure 4A).
c. Annotate UMAP plot by sequencing batch ( Figure 4C). The following parameters are required: i. feature: The feature to annotate on UMAP plot. The feature name should match the column name of the cell metadata. ii. point.size: Size of the data points on UMAP. Default value is 1. iii. mark.feature: If set to TRUE (default), the feature name will be shown on the plot. d. Annotate UMAP plot by lineages genes ( Figure 4D). 50. Visualize clusters on low dimensional space. (Figure 5A). 51. Identify cluster-specific genes. This step will perform differential gene expression analysis to identify marker genes per each cluster.

Detect and assign clusters.
Note: See details in step 27. This step will take time to run on the fully integrated datasets depending on the number of cells and identified clusters.
52. Save the integrated R object for further analyses.
Pause point: The user can save the integrative SingCellaR_Int object for further downstream analyses.

Cell type annotation
Timing: 1.5-2 h The aim of cell type annotation is to assign a cell type identity to each cluster. The expression of selected marker genes can be visualized and explored across the different clusters, either using UMAP, dotplot, heatmap, or violin plots (Satija et al., 2015). Nevertheless, it is only feasible to explore a small number of gene markers using these approaches. Moreover, closely related cell populations with overlapping or subtle differences in transcriptomic profiles can be better resolved using gene sets, rather than individual genes. To more systematically and objectively annotate the clusters, SingCellaR implements a GSEA-based cell type annotation approach whereby cluster-specific ranked genes are subjected to GSEA analysis using a comprehensive list of curated gene sets. Using this approach, for each cluster, the list of genes previously included for differential expression analysis using findMarkerGenes is ranked from the most up-regulated to most down-regulated. Next, the individual genes in each curated gene set are assessed for enrichment in this ranked list of genes. Gene sets that are found higher up in the ranked list of genes are considered to be more enriched in a given cluster (Subramanian et al., 2005). This assessment of gene set enrichment is implemented by the fgsea (Korotkevich et al., 2021) package. These curated gene sets encompass 75 hematopoietic cell types and are available at Zenodo: https://doi.org/10.5281/zenodo.5879071 in GMT file format and are also available in Table S3 from the original publication (Roy et al., 2021).

Load the SingCellaR package.
54. Load the integrated R object generated from step 52. 55. Generate the pre-ranked genes. For each cluster, differential gene expression analysis is performed (Giustacchini et al., 2017) against all other clusters and the resulting genes are ranked based on their log2 fold change multiplied by -log10 of the adjusted P-value. This is to yield a more robust ranking compared to using the log2 fold change alone for the genes for GSEA. Here, the 'identifyGSEAPrerankedGenes_for_all_clusters' function is performed to provide a suitable format for GSEA analysis and includes all possible expressed genes above the lower cut-off parameters defined using the min.expFraction and min.log2FC arguments.
The following parameter is required: a. cluster.type: The clustering method name. b. fishers_exact_test: The cut-off p-value. Default value is 0.1. c. min.expFraction: The fraction of expressing cells, above which, the gene will be included for GSEA. Default value is 0.01. d. min.log2FC: The log2 fold change, above which, the gene will be included for GSEA. Default value is 0.1.

Note:
The processing time of this step depends on the number of cells and clusters. The user is advised to save the output of this step using the following code: 56. Perform GSEA. For each cluster, the ranked genes are subjected to GSEA to assess the enrichment for all curated hematopoietic gene sets.
The following parameters are required: a. GSEAPrerankedGenes_list: The object containing the ranked genes for each cluster generated from step 55. b. gmt.file: Curated gene sets in GMT file format.
Note: Here, we curated gene sets encompassing 75 hematopoietic signatures, but the user can also generate other customized gene sets in GMT file format as the input for GSEA. Each line of the GMT file represents one gene set. Specifically, the first column represents the name of the gene set, the second column represents the description of the gene set, and the third column onwards represents the genes that constitute the gene set, whereby each column represents one gene. The GMT file should be saved in tab-delimited format.
57. Visualize GSEA results. A heatmap is used to observe and compare enrichment scores of each gene set (rows) across all clusters (columns). This visualization allows the user to annotate a cell type identity and cell states to each cluster based on the degree of enrichment of the curated gene sets. ( Figure 5B). The following parameters are required: a. isApplyCutoff: If set to TRUE, only the normalized enrichment scores (NES) of gene sets with adjusted P-values below the user-defined values in 'adjusted_pval' argument will be displayed on the heatmap. Default is FALSE. b. use_pvalues_for_clustering: If set to TRUE (default), the -log10(P-values) will be used instead of NES to cluster rows and/or columns. c. show_NES_score: If set to TRUE (default), NES will be displayed on the heatmap. d. fontsize_row: The font size of the gene set names along the rows of the heatmap. Default value is 5. e. adjusted_pval: The value, below which, NES will be displayed on the heatmap. The default value is 0.25. f. show_only_NES_positive_score: If set to TRUE, only NES > 0 will be displayed on the heatmap. Default is FALSE. g. format.digits: The number of significant digits to be used for numeric display on the heatmap. Default value is 2. h. clustering_method: The clustering method for clustering the rows and/or columns. Default is "complete". i. clustering_distance_rows: The distance metric to use when clustering the rows. Default is "euclidean". j. clustering_distance_cols: The distance metric to use when clustering the columns. Default is "euclidean". k. show_text_for_ns: If set to TRUE (default), non-significant (ns) NES will be displayed on the heatmap.
58. Visualize selected canonical marker genes using UMAP. One or more individual genes can be plotted on UMAP using the 'plot_umap_label_by_genes' function ( Figure 6A). The following parameter is required: a. gene_list: A vector of one or more gene names to plot.
59. Visualize selected canonical marker genes using bubble plot ( Figure 6B). One or more individual genes can be plotted using the 'plot_bubble_for_genes_per_cluster' function. Each gene will be represented on each row of the output.

OPEN ACCESS
The following parameters are required: a. cluster.type: The clustering method name used to identify and assign the cell clusters. b. gene_list: A vector of one or more gene names to plot. c. show.percent: If set to TRUE, the percentage of expressing cells for respective genes in each cluster are displayed on the dotplot. Default is FALSE.
60. Visualize identified marker genes for each cluster using heatmap. One or more individual genes can be plotted using a heatmap with the 'plot_heatmap_for_marker_genes' function. Each gene is represented on each row of the output.
The following parameters are required: a. cluster.type: The name of the clustering method used to identify and assign the cell clusters. b. n.TopGenes: The number of top genes for each cluster to plot. Default value is 5. 61. Export top marker genes for each cluster. The top marker genes with statistical analysis results can be exported to the text file format.

Timing: 2-3 h
The user can calculate the enrichment of specific gene sets (e.g., lineage signature and cell cycle genes) assigned for an individual cell. We incorporated AUCell score analysis (Aibar et al., 2017) into SingCellaR to assign and define high-confident lineage-specific cells. This cell-level enrichment analysis enables us to validate our cell type annotation and check for the presence of heterogeneous cell populations within a given cluster. This can be further used to compare differential abundance of different cell lineages across developmental stages.
62. Load SingCellaR and required R packages. Here, the user can load the integrated R object saved from step 52.
63. Load the integrated R object generated from step 52.
64. Build AUCell gene rankings. The user will have to create the ranked gene list using the function 'AUCell_buildRankings' implemented in AUCell package.
The following parameters are required: a. exprMat: The raw expression count matrix. This can be retrieved from the SingCellaR object using the 'get_umi_count' function. b. nCores: The number of cores to use for parallel processing. The maximum number of cores is dependent on the user's device. Default value is 1. c. plotStats: If set to TRUE (default), the expression statistics will be summarized and plotted in the histogram and boxplots.
Note: This step may be time-consuming. The user is advised to save the output of this step using the following code: 65. Calculate AUCell scores. AUCell scores for each cell will be computed using the ranked genes from the previous step for the provided hematopoietic gene sets.
Optional: The user is advised to save the AUCell scores for further analysis.
66. Visualize AUCell scores. The user can visualize AUCell scores for a given gene signature on specific clusters on the UMAP embedding. Here, we will use HSC/MPP gene signature on cluster 1 as an example ( Figure 7A). We observe HSC/MPP gene signature to be uniformly enriched across majority of cells from this cluster. This is consistent with our cluster-level GSEA results that indicate cluster 1 is enriched for several HSC and MPP gene sets.
The following parameters are required: a. AUCell_gene_set_name: The name of the gene signature to plot. The signature name specified here must be the same as the gene signature name in the GMT file provided in step 65. b. AUCell_score: The R object created from computing AUCell scores in step 65. c. selected.limited.clusters: Cells in selected cluster IDs will be displayed with the AUCell scores. d. IsLimitedAUCscoreByClusters: If set to TRUE, the AUCell scores will only be displayed for selected clusters as specified using the 'selected.limited.clusters' argument. Default is FALSE. e. AUCell_cutoff: The AUCell score threshold, above which, the scores will be displayed. The higher the score threshold, the more stringent the threshold. Default is 0.
Note: AUCell cutoff score is arbitrary. To explore the AUCell cutoff score for each gene signature, the user can plot the score distribution using ggplot2 and manually explore the suitable threshold.
67. Differential abundance testing. We use DAseq (Zhao et al., 2021) to perform pairwise comparisons for cellular abundance in different tissues across developmental stages. We have incorporated the DAseq analysis into the 'run_DAseq_comparison' function. Here, we will show the differential abundance of cell populations between eFL and FL ( Figure 7B) as an example. Erythroid progenitors are enriched in FL, whereas lymphoid and myeloid progenitors are predominantly derived from BM samples. Marked differences are also observed for megakaryocytic progenitors whereby these cells are mainly from eFL. Figure 7. UMAP plots overlaid with computed scores from AUCell and differential abundance analyses (A) The UMAP plot shows AUCell scores of the cells calculated using the HSC/MPP signature genes. Cells with the AUCcell score greater than 0.15 within cluster 1 are highlighted with gradient colors from low (black) to high (red) scores.
(B) The UMAP plot shows the differential abundance scores from the logistic classifier prediction calculated from DAseq analysis of pairwise comparison between eFL vs. FL. Cells in red are predicted to be more abundant in eFL, whereas cells in blue indicate higher abundance in FL. Cells in gray do not have substantially different abundance score.

OPEN ACCESS
The following parameters are required: a. groupA and groupB: The group names for integrated samples (e.g., eFL, FL, FBM, and PBM) or an individual sample name. b. labels.1 and labels.2: The sample IDs of groupA and groupB, respectively. c. path: The folder path to save the output file. d. outputname: The output file name.

Trajectory analysis
Timing: 1-1.5 h From the annotation of cell types assigned in the cell type annotation steps, the user can further investigate the direction of cellular differentiation trajectories. The user can trace immature to more differentiated cell populations and understand the functional relationship between different cell populations in terms of cellular maturity. It is noteworthy that current trajectory analysis requires the user to identify the starting point (immature cell state), in this case, HSC/MPP. Hence, the cell annotations inferred from previous steps will be helpful to infer the primitive cell clusters. SingCellaR supports two approaches of trajectory analysis, namely force-directed graph (FDG) (Jacomy et al., 2014) and diffusion map (Haghverdi et al., 2015). We will also perform Monocle3 analysis (Trapnell et al., 2014), a pseudotime-based method (Haghverdi et al., 2015) to infer cellular trajectories.
69. Load the integrated R object generated from step 52.
70. Run force-directed graph analysis. The user can use the 'runFA2_ForceDirectedGraph' function to build the force-directed graph layout (embeddings). The layout can be annotated using various features, including cell lineage signature genes. Here, we use the Supervised harmony embeddings to generate force-directed graph layout. The following parameters are required: a. useIntegrativeEmbeddings: If set to TRUE, the data integration or batch correction embeddings will be used in conjunction with 'integrative_method' argument. Default is FALSE. b. integrative_method: The data integration or batch correction method name. c. knn.metric: The distance metric. d. n.dims.use: The number of PCs from 'integrative_method'. If 'useIntegrativeEmbedding' is set to FALSE, the PCA analysis result is used. Default value is 30. e. n.neighbors: The number of neighboring cells. f. n.seed: The random number generator. Default value is 1. g. fa2_n_iter: The number of iterations for analyzing the 'networkx' graph. Default value is 1,000. 71. Visualize trajectories by Louvain clustering ( Figure 8A).
The following parameter is required: a. show_method: The clustering method name.
The following parameters are required: a. gmt.file: Path to the file containing the gene signatures in GMT format. b. show_gene_sets: The vector of gene signature names to show on the plot. The names must be the same names as found in the 'gmt.file'. c. custom_color: The assigned colors for gene signatures in 'show_gene_set'. d. isNormalizedByHouseKeeping: When set to TRUE (default), the gene expression values of each gene signature specified will be normalized by the housekeeping genes. The housekeeping genes are defined as the top 100 genes with the highest total gene expression values across all cells. e. edge.size: The size of the edges connecting the nodes. Default value is 0.2. 74. Load the integrated R object generated from step 52. 75. Run diffusion map analysis. The user can use the 'runDiffusionMap' function to generate the diffusion map layout (embeddings). The layout can be annotated using various features, including cell lineage signature genes. We will use the Supervised harmony embeddings to generate the diffusion map layout.
The following parameters are required: a. useIntegrativeEmbeddings: If set to TRUE, the data integration or batch correction embeddings will be used in conjunction with 'integrative_method' argument. Default is FALSE. b. integrative_method: The data integration or batch correction method name. c. n.dims.use: The number of PCs from 'integrative_method'. If 'useIntegrativeEmbedding' is set to FALSE, the PCA result will be used. Default value is 30. d. n.seed: The random number generator. Default value is 1. 79. Load the integrated R object generated from step 52.

83.
Generate trajectory graph and order cells by pseudotime. To learn the cell differentiation trajectories, the user will use the 'learn_graph' function provided by Monocle3. By default, Monocle3 uses a 'self-defined' node to perform the pseudotime analysis. Thus, the user will need to define the root node, i.e., the most immature cluster. To identify the root node, the user can use the 'get_earliest_principal_node' function. Based on the previous analyses, the user can select 'cl1', the HSC/MPP cluster, as the starting point of the trajectory.
g. Extract gene expression from downsampled cells along the path from different developmental stages and pseudotime from 'newcds' object from step 83.

EXPECTED OUTCOMES
The step-by-step protocols describe an analysis pipeline used in a recent publication (Roy et al., 2021). Here, we introduce SingCellaR as a tool to facilitate various data analyses and visualization of scRNA-seq data. We expect the outcomes from the pipelines can help reduce data analysis complications, speed up, and generate robust results. We summarize the expected outputs described in Table 1.

LIMITATIONS
SingCellaR requires signature gene sets to perform the cell type annotation analysis. Thus, the user would have to compile and curate customized gene sets for the relevant system of interest. In this protocol, we provide 75 gene sets curated from previous studies relevant to hematopoiesis.

Key step
Step Output definition Output file; Figure SingCellaR installation 1-3 Following these steps, a user should understand the installation processes of R packages. The protocols can help users successfully install SingCellaR and its dependencies on personal computers or high-performance workstation/computing clusters. We describe the possible issues for installing required Python modules in the R environment in the troubleshooting section. At the end of the step, we expect the user to run the command library(SingCellaR) successfully in the R terminal.
Processing scRNA-seq for an individual sample 4-12 These steps are for the initial analysis of an individual sample. The output files consist of nine SingCellaR objects generated from each sample. These files are used for the downstream data integration. We show the process to select high-quality cells using multiple QC plots, and we perform data normalization and identification of highly variable genes. We expect the user to inspect the QC of cells when applied using distinct cut-off parameters. AUCell analysis 62-67 SingCellaR implements AUCell analysis to assign and define high-confident lineagespecific cells. We validate cell-type annotation and compare the differential abundance of different cell lineages across developmental stages using the wrapper function for DAseq analysis.

Figure 8
The table shows a short description of protocol steps, output description, and expected output files and figures.

OPEN ACCESS
SingCellaR still lacks 'automatic object transformation' to interact with other existing packages, such as Seurat. However, SingCellaR uses the SingleCellExperiment object, the standard object for storing single-cell experimental data in R. Therefore, the gene expression matrix and cell metadata can be extracted simply from the SingleCellExperiment object. This issue will be improved when SingCellaR is updated to the next version to support more interactions with other packages and ensure compatibility with relevant R packages incorporated in SingCellaR.

Problem 1
The user may encounter the following error when executing the function 'runFA2_ ForceDirectedGraph': Error in runFA2_ForceDirectedGraph(Human_HSPC,useIntegrativeEmbeddings = T, : The fa2 python module is not installed!. Please install using pip (pip install fa2) There are two potentially problems: the fa2 package is not installed; and the R environmental path for FA2 module in python is not found by R.
Potential solution FA2 installation.
The fa2 package can be installed as suggested in the SingCellaR installation section.
Conda environment is recommended by using 'conda_create' function after loading the reticulate package.
Note: The fa2 package is not compatible with python 3.9 or higher versions.
python version configuration.
The user can use the following code to configure the python path: Note: The user must change the python path as shown here to Conda-specific path found in the user's computer.
The user can refer to Python version configuration tutorial in the reticulate R package found in this URL: https://rstudio.github.io/reticulate/articles/versions.html.
### Set the python version into R environment.

Potential solution
The scanorama package can be installed as suggested in the SingCellaR installation section. The Python environment configuration can be found in Problem 1.
Problem 3 UMAP/FDG plots may show different rotations from this protocol. This is caused by different software versions for generating plots and the seed number setting.

Potential solution
This can be solved by setting a seed number (n.seed parameter) found in runUMAP and runFA2_ ForceDirectedGraph functions.

Problem 4
The user may encounter running time and memory issues when performing the AUCell_ buildRankings function for a large-scale dataset.

Potential solution
The user can use the alternative function named 'Build_AUCell_Rankings_Fast' to speed up running time and use less memory for ranking gene expression for each cell.

Problem 5
The kBET score is used to benchmark the integration results from different integrative methods. The user may encounter slightly different kBET scores from this protocol. This is due to the different seed number settings and the number of subsampling cells for kBET analysis. More running time and memory will be used if the user sets the high number of the downsample size in kBET analysis.

Potential solution
This issue can be solved by setting the seed number prior to running kBET using the 'set.seed' function. The user can subsample and fix the number of cells for the kBET analysis using the n.sample parameter described in the 'runKBET' function.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Supat Thongjuea (supat.thongjuea@imm.ox.ac.uk).

Materials availability
This study did not generate new unique reagents.

Data and code availability
d SingCellaR open-source codes, cellranger pipeline outputs, R codes and pre-processed R objects for this protocol are available and maintained on GitHub and Zenodo listed in the key resources table.
d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.